Finetuning Qwen2.5-3B with Unscloth

Finetuning Qwen2.5-3B with SFT-Lora using Unsloth on TinyStories instruction dataset

Finetuning
LORA
Unsloth
Author

Quang T. Duong

Published

August 24, 2024

Getting started GenAI & LLM with my Udemy course, Hands-on Generative AI Engineering with Large Language Model 👇

Large language models (LLMs) are initially trained on vast amounts of unlabeled data to acquire broad general knowledge. However, this pretraining approach has limitations for specialized tasks like question answering: (1) The next-token prediction objective used in pretraining is not directly aligned with targeted tasks like QA. (2) General knowledge may be insufficient for domain-specific applications requiring specialized expertise. (3) Publicly available pretraining data may lack up-to-date or proprietary information needed for certain use cases.

Those senarios are where Supervised Fine-Tuning (SFT) comes into play. It addresses these limitations by adapting pretrained LLMs for specific downstream tasks, by (1) enabling models to learn task-specific patterns and nuances, (2) incorporating domain knowledge not present in general pretraining data, (3) improving performance on targeted applications like QA

SFT Pipeline Components

The SFT pipeline consists of several key stages, illustrated in the follwing flowchart: SFT pipeline

  1. Inputs:
  • Raw dataset containing task-specific examples
  • Pretrained base language model
  1. Instruction-Dataset Preparation:
  • Data cleaning and filtering
  • Generating instruction-answer pairs
  1. Dataset Formatting:
  • Converting data to standardized JSON formats (e.g. Alpaca, ShareGPT, OpenAI)
  • Structuring examples using chat templates (e.g. Alpaca, ChatML, Llama 3)
  1. Core SFT Process:
  • Fine-tuning the base model on the formatted instruction dataset
  • Applying SFT techniques like full fine-tuning or LoRA or QLoRA
  1. Output:
  • Task-specific fine-tuned model

This pipeline allows for systematic adaptation of LLMs to targeted applications while leveraging their pretrained knowledge. The formatted instruction datasets and chat templates provide a unified way to present diverse training examples to the model.

Note that if we fine-tune the pretrained base model, we can choose any data formats and chat templates. However, if we fine-tune an instruct model, we need to use the sample template.

SFT techniques

There are three main types of Supervised Fine-Tuning (SFT) for large language models:

  1. Full Model Fine-Tuning. This approach involves updating all parameters of the pre-trained model. It offers maximum flexibility in adapting the model to specialized tasks. Often yields significant performance improvements but requires substantial computational resources.

  2. Feature-Based Fine-Tuning. This method focuses on extracting features from the pre-trained model and used as input for another model or classifier. The main pre-trained model remains unchanged. It’s less resource-intensive and provides faster results, making it suitable when computational power is limited.

  3. Parameter-Efficient Fine-Tuning (PEFT). PEFT techniques aim to fine-tune models more efficiently. Only a portion of the model’s weights are modified, leaving the fundamental language understanding intact. It adds task-specific layers or adapters to the pre-trained model. Significantly reduces computational costs compared to full fine-tuning while still achieving competitive performance.

The choice between these approaches is based on the specific requirements of the task, available computational resources, and desired model performance.

In this article, we will discuss the two most popular and effective PEFT techniques: LoRA and QLoRA.

PEFT with LoRA and QLoRA

LoRA (Low-Rank Adaptation) is introduced in 2021 in the paper LoRA: Low-Rank Adaptation of Large Language Models by Adward et al.. It then has gained widespread adoption. It is a cost-effective and efficient method for adapting pretrained language models to specific tasks by freezing most of the model’s parameters and updating only a small number of task-specific weights. This approach leverages adapters to reduce the training overhead, making it an attractive solution for limited compute scenarios.

QLoRA (Quantized Low-Rank Adaptation) is an extension of the LoRA technique. It is proposed in the paper QLoRA: Efficient Finetuning of Quantized LLMs by Tim et al. in 2023. It quantizes the weight of each pretrained parameter to 4 bits (from the typical 32 bits). This results in significant memory savings and enables running large language models on a single GPU

When deciding between LoRA and QLoRA for fine-tuning large language models, key considerations revolve around hardware, model size, speed, and accuracy needs.

LoRA generally requires more GPU memory than QLoRA but is more efficient than full fine-tuning, making it suitable for systems with moderate to high GPU memory capacity. QLoRA, on the other hand, significantly lowers memory demands, making it more suitable for devices with limited memory resources. While LoRA is often faster, QLoRA incurs slight speed trade-offs due to quantization steps but offers superior memory efficiency, enabling fine-tuning of larger models on constrained hardware.

Accuracy and computational efficiency also differ between the two methods. LoRA typically yields stable and precise results, whereas QLoRA’s use of quantization may lead to minor accuracy losses, though it can sometimes reduce overfitting. When it comes to specific needs, LoRA is ideal if preserving full model precision is vital, whereas QLoRA shines for extremely large models or environments with tight memory constraints. QLoRA also supports varying levels of quantization (e.g., 8-bit, 4-bit, or even 2-bit), adding flexibility but at the cost of increased implementation complexity.

To implement LoRA and QLoRA in practice, we use the Unscloth framework. This is an innovative open-source framework designed to revolutionize the fine-tuning and training of large language models. It’s worth to discuss more about Unsloth in the next section.

Unsloth

Unsloth is developed by Daniel Han and Michael Han at Unsloth AI. This framework addresses some of the most significant challenges in LLM training, particularly speed and resource efficiency. Let’s check out some of its remarkable features and benefits:

  • Speed Improvements. It makes an impressive acceleration in training speed, up to 30 times faster performance compared to other advanced methods like Flash Attention 2 (FA2), completing tasks like the Alpaca benchmark in just 3 hours instead of the usual 85. This dramatic reduction in training time allows us to iterate more quickly.

  • Memory Efficiency. It achieves up to 90% reduction in memory usage compared to FA2.

  • Accuracy Preservation and Enhancement. Despite its focus on speed and efficiency, Unsloth maintains model accuracy, or up to 20% increase in accuracy using their MAX offering

  • Hardware Flexibility. It is designed to be hardware-agnostic, supporting a wide range of GPUs including those from NVIDIA, AMD, and Intel. This compatibility ensures that users can leverage Unsloth’s benefits regardless of their existing hardware setup.

Use-case

In this article, we illustrate a specific use-case: Supervised fine-tuning Qwen2.5-3B model using LoRA and QLoRA, to create and generate a story generator for children.

For the supervised aspect, we use an instruction dataset TinyStories_Instruction which contains instruction-story pairs. I have prepared this dataset in the previsous post, if you have not read it yet, I recommend you to check it out. The stories in this dataset are short and synthetically generated stories created by GPT-3.5 and GPT-4 with a limited vocabulary, making it highly suitable for our intended 5-year-old readers. While, the instruction is also created synthetically using GPT-4o-mini based on the stories.

For the pretrained language model, we use Qwen2.5-3B, a pretrained language model containing 3.09 billion parameters. I choose this for our use-case as its reasonable size, making it powerful yet suitable for fine-tuning even on resource-constrained platforms like Google Colab.

For the implementation part, we leverage Unsloth for speed and memory efficiency.

Implementation

To achieve the fine-tuning, we will utilize the following libraries and methods:

Step 1: Import Necessary Libraries

import os
import comet_ml
import torch
from trl import SFTTrainer
from datasets import load_dataset, concatenate_datasets
from transformers import TrainingArguments, TextStreamer
from unsloth import FastLanguageModel, is_bfloat16_supported
from google.colab import userdata

Step 2: Comet ML Login

comet_ml.login(project_name="sft-lora-unsloth")

Step 3: Load Pretrained Model and Tokenizer

max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-0.5B",
    max_seq_length=max_seq_length,
    load_in_4bit=False,
    )

Step 4: Apply LoRA Adaptation

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    lora_alpha=32,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    )

Step 5: Formatting Dataset

Prepare the dataset using a specific text template and map it accordingly.

alpaca_template = """Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
def format_samples(examples):
    text = []
    for instruction, output in zip(examples["instruction"], examples["output"], strict=False):
        message = alpaca_template.format(instruction, output) + EOS_TOKEN
        text.append(message)

    return {"text": text}

dataset = dataset.map(format_samples, batched=True, remove_columns=dataset.column_names)

Step 6: Setting Up the Trainer

Utilize the SFTTrainer for supervised fine-tuning.

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=1e-5,
        lr_scheduler_type="linear",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        num_train_epochs=3,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=10,
        output_dir="output",
        report_to="comet_ml",
        seed=0,
        ),
    )
trainer.train()

Step 7: Model Inference

Generate a response using the fine-tuned model.

FastLanguageModel.for_inference(model)
message = alpaca_template.format("Write a story about a humble little bunny \
named Ben who follows a mysterious trail in the woods, \
discovering beautiful flowers, new friends, and a lovely pond along the way.", "")
inputs = tokenizer([message], return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=256, use_cache=True)

Step 8: Save and Push to Hugging Face Hub

from huggingface_hub import login
# Log in to the Hugging Face Hub
login(token=userdata.get('HF_TOKEN'))

model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("tanquangduong/Qwen2.5-0.5B-Instruct-TinyStories", tokenizer, save_method="merged_16bit")

Inference

Using the fine-tuned model for inference:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tanquangduong/Qwen2.5-3B-Instruct-TinyStories")
model = AutoModelForCausalLM.from_pretrained("tanquangduong/Qwen2.5-3B-Instruct-TinyStories")
model = model.to("cuda")
alpaca_template = """Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
{}"""


FastLanguageModel.for_inference(model)

message = alpaca_template.format("Write a story about a humble little bunny named Ben who follows a mysterious trail in the woods, discovering beautiful flowers, new friends, and a lovely pond along the way.", "")
inputs = tokenizer([message], return_tensors="pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=2048, use_cache=True)

Conclusion

This guide walked through the supervised fine-tuning process of Qwen2.5-3B using the Unscloth framework and LoRA adapters. Fine-tuning such models with cost-effective methods like LoRA makes it feasible for smaller setups, such as those utilizing Colab. The end result is a model that can generate customized responses tailored to specific use cases, such as creating Tiny Stories. This approach emphasizes the flexibility and power of modern transformer-based architectures in domain-specific tasks.